Clustering Methods in Bioinformatics

Hierarchical Clustering

Prerequisites: None.

Level: Beginner.
Learning objectives:

Introduction

Welcome to our tutorial on Hierarchical Clustering! Hierarchical Clustering is a method of clustering data points into groups (also known as clusters) based on their similarity. It involves creating a hierarchy of clusters, where each cluster is further divided into smaller clusters and stops when all data points are included in a cluster.

Hierarchical Clustering is an essential technique in data analysis and is widely used in various fields, including image processing, bioinformatics, and social sciences.

In this tutorial, we will discuss the different types of Hierarchical Clustering algorithms, how they work, and how to implement them in Python.

After you have completed this tutorial, you will have a solid understanding of Hierarchical Clustering and be able to use it to cluster data points in your projects. Let's get started!

What is Hierarchical Clustering?

Two main types of Hierarchical Clustering algorithms exist: Agglomerative and Divisive. Agglomerative Hierarchical Clustering begins by treating each data point as its own cluster and then merging the most similar clusters until all data points are included in a single cluster. The process creates a hierarchy of clusters; the data points are at the bottom, and the final cluster is at the top.

Divisive Hierarchical Clustering, on the other hand, starts with all data points in a single cluster and then divides the cluster into smaller clusters until each data point is in its own cluster.

Agglomerative and Divisive Hierarchical Clustering algorithms require a measure of similarity between data points, and we can calculate this using a distance metric, such as Euclidean distance or cosine similarity. The choice of distance metric will depend on the data's nature and the application's specific requirements.

There are several advantages to using Hierarchical Clustering. One of the primary advantages is that it does not require the user to specify the number of clusters in advance. Not needing to restrict the number of clusters in advance is useful when the number of clusters is unknown or when it is difficult to determine the optimal number.

Hierarchical Clustering also produces a hierarchy of clusters, which can help understand the structure of the data and visualize the relationships between data points.

This tutorial will discuss how to implement Hierarchical Clustering in Python using the scikit-learn library. We will walk through the process of preparing the data, calculating the similarity between data points, and creating the hierarchy of clusters.

We will also discuss how to evaluate the results of the clustering and how to visualize the clusters using dendrograms.

Overview of Hierarchical Clustering Implementation

To start off, we will need to prepare our data for clustering by cleaning and preprocessing the data and selecting the relevant features for clustering. It is essential to ensure that the data is in a format that the clustering algorithm can efficiently process.

Once the data is prepared, we can begin the clustering process. We will start by calculating the similarity between data points using a distance metric. Several distance metrics can be used, such as Euclidean distance, cosine similarity, and Manhattan distance. The choice of distance metric will depend on the nature of the data and the specific requirements of the application.

Once the similarity between data points has been calculated, we can begin creating the hierarchy of clusters. In Agglomerative Hierarchical Clustering, we start by having each data point as its own cluster. And continue by merging the most similar clusters. We top when all clusters are included in a single hierarchy.

In Divisive Hierarchical Clustering, we start with all data points in a single cluster and then divide the cluster into smaller clusters until each data point is in its own cluster.

It is crucial to evaluate the clustering results to ensure that the clusters are meaningful and accurate. Several metrics can be used to assess the performance of a clustering algorithm, such as the silhouette score and the Calinski-Harabasz index.

Finally, we can visualize the clusters using a dendrogram, a tree-like diagram showing the hierarchy of clusters. A dendrogram can be a valuable tool for understanding the data's structure and identifying patterns in the clusters.

Divisive Hierarchical Clustering Algorithm

given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster

Agglomerative Clustering algorithm

given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains

Comparison of Divisive and Agglomerative Clustering Algorithms

Computing The Distance Between Clusters

The following section will discuss implementing Hierarchical Clustering in Python using the scikit-learn library. We will walk through the process of preparing the data, calculating the similarity between data points, and creating the hierarchy of clusters. We will also discuss how to evaluate the clustering results and how to visualize the clusters using a dendrogram.

Implementing Hierarchical Clustering in Python

To implement Hierarchical Clustering in Python, we will use the scikit-learn library. Scikit-learn is a popular library for machine learning in Python, and it provides a wide range of tools for clustering, including Agglomerative Hierarchical Clustering and Divisive Hierarchical Clustering.

We will start by importing the necessary libraries and loading the data. We will use the "load_iris" function from scikit-learn to load the Iris dataset, a well-known dataset in machine learning. The Iris dataset consists of 150 data points, each with four features: sepal length, sepal width, petal length, and petal width. The goal is to cluster the data points into groups based on their similarity.

Next, we will preprocess the data by standardizing the features. Standardization is a common preprocessing step in machine learning that scales the data to have zero mean and unit variance. Standardization is essential for algorithms that rely on distances between data points, ensuring that all features are on the same scale. We will use the "StandardScaler" from scikit-learn to standardize the features.

Once the data is preprocessed, we can calculate the similarity between data points using a distance metric. We will use the "pdist" function from scikit-learn to calculate the pairwise distances between all data points. We will use the Euclidean distance as the distance metric.

Next, we will create the hierarchy of clusters using the "linkage" function from scikit-learn. The "linkage" function performs Agglomerative Hierarchical Clustering, starting from the bottom of the hierarchy and merging the most similar clusters until all data points are included in a single cluster. We will use the "ward" method to merge clusters, a variance-based strategy that aims to minimize the total within-cluster variance.

Once the hierarchy of clusters is created, we can use the "fcluster" function from scikit-learn to extract the clusters from the hierarchy and assign each data point to a cluster. We will use the "maxclust" parameter to specify the maximum number of clusters to extract.

Finally, we can evaluate the clustering results using the silhouette score, which measures the compactness and separation of the clusters. A high silhouette score indicates that the clusters are well-defined and distinct. We will use the "silhouette_score" function from scikit-learn to calculate the silhouette score.

We can also visualize the clusters using a dendrogram. A dendrogram is a tree-like diagram that shows the hierarchy of clusters. We will use the "dendrogram" function from scikit-learn to plot the dendrogram.

References and further reading

Back to Clustering Methods in Bioinformatics